Homework 2

Author

Liam Smith

Task 1

We are going to return to the table of the top 100 wrestlers: https://www.cagematch.net/?id=2&view=statistics. Specifically, you are going to get the ratings/comments tables for each wrestler.

# Import necessary libraries
import requests
import pandas as pd
import re
import plotly.express as px
from joblib import dump, load
from langdetect import detect
from bs4 import BeautifulSoup
# First get IDs for each wrestler
link = 'https://www.cagematch.net/?id=2&view=statistics'
top_wrestlers = requests.get(link)
top_wrestlers_soup = BeautifulSoup(top_wrestlers.content, 'html.parser')
top_wrestlers = top_wrestlers_soup.select(
  '.TRow1, .TRow2'
)
len(top_wrestlers)

# Empty dataframe for their ID
wrestler_ids = pd.DataFrame(columns = ['Name', 'ID'])
for wrestler in top_wrestlers:
  # Find and pull out ID
  id_cols = wrestler.find_all('td')
  wrestler_id = id_cols[1].find('a')['href']
  wrestler_id = re.search(r'(?<=nr=)\d+', wrestler_id).group(0)
  wrestler_id = int(wrestler_id)
  # Find and pull out name
  name_cols = wrestler.find_all('td')
  name = name_cols[1].text.strip()
  wrestler_ids = pd.concat([wrestler_ids, pd.DataFrame({'Name': [name], 'ID': [wrestler_id]})])

# Clean up dataframe
wrestler_ids.index = range(len(wrestler_ids))
# Go get comments and ratings
wrestler_df = pd.DataFrame(columns = ['Name', 'Rating', 'Comment'])
for id in wrestler_ids['ID']:
  link = f'https://www.cagematch.net/?id=2&nr={id}&page=99'
  comments = requests.get(link)
  comment_soup = BeautifulSoup(comments.content, 'html.parser')
  wrestler = comment_soup.select('.CommentContents')
  wrestler = [wrestler[i].getText() for i in range(len(wrestler))]
  wrestler = pd.DataFrame(wrestler, columns = ['Comment'])
  for index, row in wrestler.iterrows():
    comment = row['Comment']
    rating = re.search(r'^\[\d+\.?\d?\]', comment)
    if rating:
      rating = rating.group(0)
    else:
      rating = 'N/A'
    comment = re.sub(r'^\[\d+\.?\d?\]', '', comment)
    name = wrestler_ids[wrestler_ids['ID'] == id]['Name'].values[0]
    wrestler_df = pd.concat([wrestler_df, pd.DataFrame({'Name': [name], 'Rating': [rating], 'Comment': [comment]})], ignore_index=True)

Task 2

Perform any form of sentiment analysis. What is the relationship between a reviewer’s sentiment and their rating?

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_lg')
nlp.add_pipe('spacytextblob')
sentiment_df = wrestler_df.copy()
#sentiment_df['Sentiment'] = sentiment_df['Comment'].apply(lambda x: nlp(x)._.blob.polarity)

#dump(sentiment_df, 'sentiment_df.joblib')
# Load sentiment_df
import joblib
sentiment_df = joblib.load('G:/My Drive/MSBA/Unstructured/unstructured-notes/hw2/sentiment_df.joblib')

# Remove links
sentiment_df = sentiment_df[~sentiment_df['Comment'].str.contains('http')]

# Detect language and remove rows where language is not English
sentiment_df['Language'] = sentiment_df['Comment'].apply(detect)
sentiment_df = sentiment_df[sentiment_df['Language'] == 'en']
sentiment_df.drop(columns=['Language'], inplace=True)

# Remove rows where rating is N/A
sentiment_df = sentiment_df[sentiment_df['Rating'] != 'N/A']
# Remove brackets from rating
sentiment_df['Rating'] = sentiment_df['Rating'].str.replace('[', '').str.replace(']', '')
# Convert rating to int
sentiment_df['Rating'] = sentiment_df['Rating'].astype(float)
'''
import plotly.express as px
fig = px.scatter(sentiment_df, x = 'Rating', y = 'Sentiment', color = 'Name')
fig.update_layout(showlegend=False)
fig.update_layout(title = 'Sentiment vs Rating')
fig.update_layout(template = 'plotly_dark')
fig.show()
'''
# Commented out because box plot is better since there is only certain values of Rating
"\nimport plotly.express as px\nfig = px.scatter(sentiment_df, x = 'Rating', y = 'Sentiment', color = 'Name')\nfig.update_layout(showlegend=False)\nfig.update_layout(title = 'Sentiment vs Rating')\nfig.update_layout(template = 'plotly_dark')\nfig.show()\n"
fig2 = px.box(sentiment_df, x = 'Rating', y = 'Sentiment')
fig2.update_layout(title = 'Sentiment vs Rating')
fig2.update_layout(template = 'plotly_dark')
fig2.update_layout(showlegend=False)
fig2.show()

Task 3

Perform any type of topic modeling on the comments. What are the main topics of the comments? How can you use those topics to understand what people value?

# Tried it...
'''
from deep_translator import GoogleTranslator

# Define function to translate
def translate_to_english(text):
  translator = GoogleTranslator(source='auto', target='en')
  return translator.translate(text)

# TEXT IS TOO LONG. MUST BE LESS THAN 500 CHARACTERS.
# Apply function to comments for first 1000 rows
sentiment_df['Comment'] = sentiment_df['Comment'].apply(lambda x: translate_to_english(x))
'''
"\nfrom deep_translator import GoogleTranslator\n\n# Define function to translate\ndef translate_to_english(text):\n  translator = GoogleTranslator(source='auto', target='en')\n  return translator.translate(text)\n\n# TEXT IS TOO LONG. MUST BE LESS THAN 500 CHARACTERS.\n# Apply function to comments for first 1000 rows\nsentiment_df['Comment'] = sentiment_df['Comment'].apply(lambda x: translate_to_english(x))\n"
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

# Clustered Tfidf
ctfidf_model = ClassTfidfTransformer(
  reduce_frequent_words = True
)

# Create BERTopic model
#topic_model = BERTopic(ctfidf_model = ctfidf_model)

# Fit model
#topics, probs = topic_model.fit_transform(sentiment_df['Comment'].to_list())

# Save model
#dump(topic_model, 'topic_model.joblib')
#dump(topics, 'topics.joblib')
#dump(probs, 'probs.joblib')
# Load model
import joblib
topic_model = joblib.load('G:/My Drive/MSBA/Unstructured/unstructured-notes/hw2/topic_model.joblib')

topic_model.get_topic(0)
topic_model.get_topic(1)
topic_model.get_topic(2)
topic_model.get_topic_info()

The top topic contains words that you would expect to see in positive comments about a wrestler. Words like “legend”, “greatest”, and “technical” used to talk about how one feels about a wrestler. These positive words being the top topic, along with the sentiment analysis being mostly positive, suggests to me that most people are only coming to comment on Cagematch if they feel very passionately about a wrestler. The next four topics are all about a specific wrestler: CM Punk, Chris Jericho, the Undertaker, and AJ Styles. This suggests a high popularity of these wrestlers where a lot of people are coming to comment on their page. The top words in these topics also indicate that people closely relate wrestlers with others, more specifically, they characterize wrestlers by their rivalries. CM Punk had one of the most iconic rivalries in WWE history with John Cena and this is shown through the frequency of the words “john” and “cena” in the topic. Another frequent word in CM Punk’s topic is “pipebomb”, which was a promo in which CM Punk criticized the WWE. This shows that big moments stick with fans and impact their view of a wrestler. Professional wrestling is undoubtedly a form of entertainment unlike any other. Cagematch tells us that the love it so much due to the greatness of certain wrestlers, the rivalries they have, and the moments they create.